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REMARKS 

Claims 1-1 1, 13-22 and 24-47 are currently presented for examination. Claims 136-143 
have been added and claims 1 and 45 have been amended. Neither the new claims nor the claim 
amendments constitute addition of new matter to the instant application. 

Support for each of the new claims and claim amendments can be found in the claims as 
originally filed and throughout the specification. In particular, support for the amendment to 
claim 1 can be found on page 40, line 21 to page 45, line 2; page 98, line 6 to page 103, line 22 
(Example 7); the claims as originally filed and elsewhere throughout the specification. Support 
for new claims 136-143 can be found on page 44, lines 1-13; page 45, lines 22-28; page 46, line 
28 to page 47, line 21; page 51, line 23 to page 52, line 13; page 56, line 14 to page 57, line 10; 
page 76, line 1 to page 78, line 17; page 103, line 24 to page 109, line 15 (Example 8) and 
elsewhere throughout the specification. Accordingly, no new matter has been added to the 
application. 

Applicants have reviewed the rejections of claims 1-11, 13-22 and 24-47 as set out in the 
instant Office Action. After careful consideration, Applicants respectfully traverse these 
rejections. 

Information Disclosure Statement 

Applicants would like to draw the Examiner's attention to the Information Disclosure 
Statement submitted herewith, which includes an Office Action from copending U.S. Patent 
Application Number 09/948,993. 

Objection to the Specification 

The specification of the instant application is objected to because it contains an embedded 
hyperlink at page 54. The Examiner has requested that Applicants delete all browser executable 
code contained within the application. 

Applicants have searched the specification and found browser executable code occurring 
on pages 52-54 and 119. In some instances, Applicants have deleted the text containing the 
browser executable code. In other instances, Applicants have replaced the text containing the 
browser executable code with text that teaches the public how to access the intended website by 
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using a web browser. Accordingly, the amended specification no longer contains browser 
executable code. 

In view of the above amendments, Applicants respectfully request the Examiner to 
withdraw the objection to the specification. 

Other Amendment to the Specification 

Applicants have amended the specification to correct a typographical error at page 50, 
line 10. In particular, the Granger et al, Published PCT Application No. WO 98/01579, was 
inadvertently cited as WO 98/01879. Applicants have amended the paragraph beginning at page 
49, line 19 and ending at page 50, line 13 to correct this error. 

Rejection of Claims 1-1 L 25, 28-40 and 42-47 Under 35 U.S.C. § 1 12. Second Paragraph 

The Examiner rejects claims 1-11, 25, 28-40 and 42-47 under 35 U.S.C. § 112, second 
paragraph as failing to particularly point out and distinctly claim subject matter that is regarded 
as an invention. In particular, the Examiner asserts that the phrase "said fusion promoter 
comprising at least one promoter that is modified to have altered activity in at least one gram- 
positive organism" is allegedly vague and indefinite because the claim allegedly neither sets forth 
how the promoter is "modified" nor the type of "altered activity" that results. In addition, with 
respect to Claim 45, the Examiner alleges that a word is missing prior to the word "microbe." 

Applicants maintain that Claims 1-11, 25, 28-40 and 42-47 are not indefinite; however, 
solely to expedite the allowance of these claims, Applicants have replaced the phrase "that is 
modified to have altered activity" in original claim 1 (and by dependence, claims 2-11, 25, 28-40 
and 42-47), with the phrase "comprising at least one nucleotide sequence modification which 
alters the transcriptional activity of said promoter." Applicants respectfully submit that this 
amendment addresses the Examiner's concerns regarding the use of the term "modified" and the 
phrase "altered activity." 

In addition to the foregoing amendment, Applicants have corrected the typographical 
error in claim 45 by adding the word "a" just prior to the word "microbe." This amendment is in 
accordance with the Examiner's reading of the claim as stated on page 5 of the instant Office 
Action. 



25- 



Appl. No. : 10/032,393 

Filed : December 21, 2001 

In view of the amendments to claims 1 and 45, Applicants request that the Examiner 
withdraw the rejection of claims 1-11, 25, 28-40 and 42-47 under 35 U.S.C. § 112, second 
paragraph. 

Rejection of Claim 28 Under 35 U.S.C. § 112, First Paragraph 

The Examiner rejects claim 28 under 35 U.S.C. § 112, first paragraph as containing 
subject matter that was not described in the specification in such a way as to enable a skilled 
artisan to make and/or use the claimed subject matter. In particular, the Examiner asserts that "it 
is not clear that all of the plasmids [named in claim 28] are readily available to the public." 

Applicants respectfully submit that each of the plasmids named in claim 28 are known 
and readily available to the public. Applicants note claim 28 recites a vector "comprising at least 
one replicon selected from the group consisting of pl5a, pC194 and pCT1138." Thus, the 
vectors set forth in claim 28 comprise an origin of replication that is equivalent to the origin of 
replication form one or more of the plasmids pi 5a, pC194 or pCT1138. Each of the named 
plasmids, pi 5a, pC194 and pCTl 138 were known and available to the public at the time of filing 
the instant application. Moreover, each of these plasmids are still currently known and still 
publicly available. 

Examples of acceptable relevant evidence that can be used to demonstrate that a 
biological material is known and available to the public are listed in the M.P.E.P.. In particular, 
M.P.E.P. § 2404.01 states that evidence relevant to demonstrate that a biological material is 
known and readily available to the public includes a showing of commercial availability, 
reference to the biological materials in printed publications, declarations of accessibility by those 
working in the field, evidence of predictable isolation techniques, or an existing deposit made in 
accordance with the rules set out in 37 C.F.R. § 1.801-1.809. (see M.P.E.P. § 2404.01). 

To demonstrate that each of the vectors named in claim 28 are known and available to the 
public, Applicants provide herewith Exhibits A-F. Exhibit A shows that plasmid pi 5a was first 
isolated from E. coli in 1968. A reliable method for isolating this plasmid is described by 
Cozzarelli, N. R. et al. in PNAS 60:992-999. Since its first isolation in 1968, pi 5a has been 
widely distributed among scientists and its replicon has been used in the construction of a variety 
of plasmid vectors. Many such vectors are commercially available. For example, the pi 5a 
replicon is contained in the plasmids pLysS and pLysE, both of which are available for purchase 
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from Novagen (see Exhibit B). In view of the foregoing evidence, which demonstrates the 
widespread commercial availability of pi 5a, its extensive use and a predicable method for its 
isolation, Applicants respectfully submit that pi 5a is known and available to the public. 

Similar to pi 5a, the plasmid pC194 is known and available to the public. In particular, 
pC 1 94 is a plasmid that was first isolated from Staphylococcus aureus and which has been in use 
since at least 1978 (see Exhibit C, see also Iordanescu, et al. Plasmid 1:468-479). Exhibit D 
shows that pC194 is publicly available from the depository Deutsche Sammlung von 
Mikroorganismen und Zellkulturen GmbH (hereinafter DSMZ), Braunschweig, Germany. 
DSMZ is an international depository authority recognized under the Budapest Treaty (see 
M.P.E.P. § 2405). In view of the foregoing evidence, Applicants respectfully submit that pC194 
is known and available to the public. 

The final plasmid named in claim 28, pCT1138, is also known and available to the 
public. The plasmid pCT1138, which is also known as the citrate plasmid, was originally 
isolated from Lactococcus lactis subsp. lactis (see Exhibit E; see also Pederson et al. Mol Gen. 
Genet 244:374-382). Exhibit F shows that pCT1138 is publicly from the international 
depository authority, DSMZ. In view of the foregoing evidence, Applicants respectfully submit 
that pCTl 138 is known and available to the public. 

In view of the above arguments, which demonstrate that each of the plasmids named in 
claim 28 are known and available to the public, Applicants request that the Examiner withdraw 
the rejection of this claim under 35 U.S.C. § 112, first paragraph. 

Rejection of Claims L 7, 9 and 10 Under 35 U.S.C. § 102(b) 

The Examiner rejects claims 1, 7, 9 and 10 under 35 U.S.C. § 102(b) as being anticipated 
by published International Patent Application No. WO99/28508 (Marra et al.). In particular, the 
Examiner asserts that Marra et al. disclose an isolated nucleic acid comprising a fusion promoter 
having a promoter that is modified to have an altered activity in a gram-positive organism, 
wherein the promoter is linked to tetO. Furthermore, the Examiner asserts that Marra et al. 
disclose that the binding of a repressor to tetO represses transcription. 

Applicants maintain that Marra et al. does not anticipate claim 1, which is independent, 
nor does it anticipate claims 7, 9 or 10, which are dependent on claim 1. In particular, Applicants 
point out that independent claim 1 recites, in relevant part, "an isolated nucleic acid comprising a 
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fusion promoter said fusion promoter comprising at least one promoter comprising at least one 
nucleotide sequence modification which alters the transcriptional activity of said promoter in at 
least one gram-positive organism said promoter being linked to at least one operator selected 
from the group consisting of xylO, tetO, trpO, malO and AclO . . . " The Examiner asserts that 
Marra et al. disclose fusion promoters comprising modified promoters linked to an operator 
sequence. After having carefully reviewed Marra et al., Applicants maintain that Marra et al. do 
not disclose a promoter which comprises a nucleotide sequence modification that alters the 
transcriptional activity of the promoter in a gram-positive organism. As such, Marra et al. does 
not disclose the claimed fusion promoters. 

In view of the foregoing remarks, Applicants respectfully request that the Examiner 
withdraw the rejection of claims 1, 7, 9 and 10 under 35 U.S.C. § 102(b). 

Rejection of Claims 1-4, 7-1 L 13-15, 18-22. 24-27, 29, 32. 33. 36-40 and 42-47 Under 35 U.S.C. 
§ 103(a) 

The Examiner rejects claims 1-4, 7-11, 13-15, 18-22, 24-27, 29, 32, 33, 36-40 and 42-47 
under 35 U.S.C. § 103(a) as being obvious over published European Patent Application No. 
EP0 186069 (Bujard et al.) in view of Sizemore, et al. (J. Bacteriol 174:3042-3048). In 
particular, the Examiner asserts that Bujard et al. disclose an isolated nucleic acid comprising a 
fusion promoter comprising the T5 promoter operatively linked to a lac operator and that 
Sizemore et al. allegedly disclose a xylose operon control region from Staphylococcus xylosus, 
which includes the xyl operator. The Examiner also asserts that it would have been obvious for a 
skilled artisan to combine the disclosures Bujard et al. and Sizemore et al. to achieve the subject 
matter of the above-mentioned claims because Bujard et al. suggest that the T5 promoter can be 
combined with any operator. The Examiner then asserts that a skilled artisan would have been 
motivated to combine these disclosures in view of the known equivalence of the xyl and lac 
operators and the known usefulness of the xylO operator. 

Applicants maintain that claims 1-4, 7-11, 13-15, 18-22, 24-27, 29, 32, 33, 36-40 and 42- 
47 are not obvious because the combination of Bujard et al. and Sizemore et al. does not teach 
every element of these claims. Furthermore Applicants maintain that these claims are not 
obvious because, a skilled artisan would neither be motivated to combine the disclosure of 
Bujard et al. with Sizemore et al. nor would a skilled artisan reasonably expect a promoter from a 
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gram-negative organism fused to an operator from a gram-positive organism to successfully 
promote regulatable transcription in an organism. Each of these arguments are set out in detail 
below. 

The above-rejected claims are not obvious because the combination of Bujard et al. and 
Sizemore et al. does not teach every element of these claims. For example, Claim 1 recites, in 
relevant part, a fusion promoter comprising "at least one nucleotide sequence modification which 
alters the transcriptional activity of said promoter in at least one gram-positive organism." 
Neither Bujard et al. nor Sizemore et al. disclose a promoter that comprises at least one 
nucleotide sequence modification which alters the transcriptional activity of the promoter in a 
gram-positive organism. As such, the combination of Bujard and Sizemore does not disclose 
every element of the rejected claims. 

In addition to the foregoing, Applicants submit that claims 1-4, 7-11, 13-15, 18-22, 24- 
27, 29, 32, 33, 36-40 and 42-47 are not obvious because a skilled artisan would neither be 
motivated to combine the disclosure of Bujard et al. with Sizemore et al. nor would a skilled 
artisan reasonably expect a promoter from a gram-negative organism fused to an operator from a 
gram-positive organism to promote regulatable transcription in an organism. In particular, a 
fusion promoter construct that is based on the combination of Bujard et al. and Sizemore et al. 
would allegedly be a fusion promoter having a T5 promoter operably linked to a xyl operator. 
Applicants note that the T5 promoter is a coliphage promoter. Coliphage promoters are 
functional in E. coli and certain other gram-negative organisms. The xyl operator is an operator 
that is obtained from the xylose metabolism operon from S. xylosus, which is a gram-positive 
organism. E. coli does not have a repressor protein that binds the xyl operator. Furthermore, it is 
generally known in the art that promoters obtained from gram-negative organisms generally do 
not function in gram-positive organisms (see Jarmer et al. (2001) Microbiology 147:2417-2424, 
page 1, column 2 - a copy of this reference is enclosed herewith for the Examiner's convenience 
as Exhibit G). Based on the above facts, a skilled artisan would not reasonably expect a T5 
promoter fused to a xyl operator to function as a regulatable promoter in either a gram-negative 
or a gram-positive organism since it would not be expected to be regulatable in the gram- 
negative organism and it would not be expected to be transcriptionally active in the gram- 
positive organism. As such, one of ordinary skill in the art would not be motivated to combine a 
T5 promoter with a xyl operator from a gram-positive organism. 
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In view of the foregoing remarks, Applicants respectfully request that the Examiner 
withdraw the rejection of claims 1-4, 7-11, 13-15, 18-22, 24-27, 29, 32, 33, 36-40 and 42-47 
under 35 U.S.C. § 103(a). 

Rejection of Claims 1-4. 7-11. 13-15. 18-22. 24-29. 32. 33. 36-40 and 42-47 Under 35 U.S.C. 
§ 103(a) 

The Examiner rejects claims 1-4, 7-11, 13-15, 18-22, 24-29, 32, 33, 36-40 and 42-47 
under 35 U.S.C. § 103(a) as being obvious over Bujard et al. in view of Sizemore, et al. and in 
further view of U.S. Patent No. 4,959,311 (Shih et al.), U.S. Patent No. 4,656,136 (Kisumi et al.) 
or Pederson et al. (Mol. Gen. Genet. 244:374-382). In particular, the Examiner reiterates the 
rejection based on the combination of Bujard et al. and Sizemore et al. and then further asserts 
that Shih et al., Kisumi et al. and Pederson et al. allegedly "disclose, respectively, the pi 5a, 
pC194 and pCT1138 replicons and their usefulness in cloning and expression plasmids." The 
Examiner then asserts that a skilled artisan would have been motivated to combine the above 
references in order "to utilize such known replicons in order to take advantage of the known 
replication properties in the microorganism of interest." 

Applicants maintain that claims 1-4, 7-11, 13-15, 18-22, 24-29, 32, 33, 36-40 and 42-47 
are not obvious since the combination of Bujard et al. and Sizemore et al. fails to teach every 
element of the above-rejected claims and because the disclosures of Shih et al, Kisumi et al. and 
Pederson et al. do not provide the missing element. In particular, Shih et al, Kisumi et al. and 
Pederson et al. do not disclose "at least one nucleotide sequence modification which alters the 
transcriptional activity of said promoter in at least one gram-positive organism." 

In addition to the foregoing, Applicants submit that claims 1-4, 7-11, 13-15, 18-22, 24- 
29, 32, 33, 36-40 and 42-47 are not obvious because, based on the above-cited combinations of 
references, a skilled artisan would neither be motivated to construct the claimed 
promoter/operator fusions nor would they expect such fusions to regulate transcription in an 
organism. In particular, the references Shih et al., Kisumi et al. and Pederson et al. do not teach 
or suggest that a combination of a T5 promoter with a xyl operator would be a construct that 
would have a regulatable transcriptional activity in an organism. In fact, the general knowledge 
in the art, as exemplified by Jarmer et al. (Exhibit G), suggests that such constructs would not 
possess regulatable transcriptional activity in either gram positive or gram negative organisms. 
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As such, the combination of the above-cited references does not render the subject matter of 
claims 1-4, 7-11, 13-15, 18-22, 24-29, 32, 33, 36-40 and 42-47 obvious. 

In view of the foregoing remarks, Applicants respectfully request that the Examiner 
withdraw the rejection of claims 1-4, 7-11, 13-15, 18-22, 24-29, 32, 33, 36-40 and 42-47 under 
35U.S.C. § 103(a). 

Rejection of Claims 1-1-1, 13-27, 29, 32-40 and 42-47 Under 35 U.S.C. § 103(a) 

The Examiner rejects claims 1-11, 13-27, 29, 32-40 and 42-47 under 35 U.S.C. § 103(a) 
as being obvious over Bujard et al. in view of Sizemore, et al and in further view of U.S. Patent 
No. 5,362,646 (the '646 patent). In particular, the Examiner reiterates the rejection based on the 
combination of Bujard et al. and Sizemore et al. and then further asserts that the '646 patent 
allegedly discloses nucleic acids which comprise a coliphage T promoter linked to lacO and 
further discloses that such fusion constructs can contain two operators. The Examiner then 
asserts that one of ordinary skill in the art would have been motivated to combine the above 
references so as implement two operator embodiments of the claimed fusion promoters in order 
to obtain additional expression control. 

Applicants maintain that claims 1-11, 13-27, 29, 32-40 and 42-47 are not obvious since 
the combination of Bujard et al. and Sizemore et al. fails to teach every element of the above- 
rejected claims and because the disclosure of the '646 patent does not provide the missing 
element. In particular, the '646 patent does not disclose "at least one nucleotide sequence 
modification which alters the transcriptional activity of said promoter in at least one gram- 
positive organism." 

In addition to the foregoing, Applicants submit that claims 1-11, 13-27, 29, 32-40 and 42- 
47 are not obvious because, based on the above-cited combinations of references, a skilled 
artisan would neither be motivated to construct the claimed promoter/operator fusions nor would 
they expect such fusions to regulate transcription in an organism. In particular, the '646 patent 
does not teach or suggest that a combination of a T5 promoter with a xyl operator would be a 
construct that would have a regulatable transcriptional activity in an organism. In fact, the 
general knowledge in the art, as exemplified by Jarmer et al. (Exhibit G), suggests that such 
constructs would not possess regulatable transcriptional activity in either gram positive or gram 
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negative organisms. As such, the combination of the above-cited references does not render the 
subject matter of claims 1-1 1, 13-27, 29, 32-40 and 42-47 obvious. 

In view of the foregoing remarks, Applicants respectfully request that the Examiner 
withdraw the rejection of claims 1-11, 13-27, 29, 32-40 and 42-47 under 35 U.S.C. § 103(a). 

Rejection of Claims 1-4, 7-1 L 13-15. 18-22. 24-27. 29-33. 36-40 and 42-47 Under 35 U.S.C. 
§ 103fa) 

The Examiner rejects claims 1-4, 7-11, 13-15, 18-22, 24-27, 29-33, 36-40 and 42-47 
under 35 U.S.C. § 103(a) as being obvious over Bujard et al. in view of Sizemore, et al and in 
further view of Israelson et al. {Appl Environ. Microbiol. 61: 2540-2547). In particular, the 
Examiner reiterates the rejection based on the combination of Bujard et al. and Sizemore et al. 
and then further asserts that Israelson et al. allegedly disclose the lacL-lacM reporter gene from 
Leuconostoc mesenteroides. The Examiner then asserts that one of ordinary skill in the art would 
have been motivated to combine the above references so as obtain the claimed fusion 
promoter/reporter gene constructs "in order to obtain information regarding the level of 
expression of the promoter region." 

Applicants maintain that claims 1-4, 7-11, 13-15, 18-22, 24-27, 29-33, 36-40 and 42-47 
are not obvious since the combination of Bujard et al. and Sizemore et al. fails to teach every 
element of the above-rejected claims and because the disclosure of Israelson et al. does not 
provide the missing element. In particular, Israelson et al. do not disclose "at least one 
nucleotide sequence modification which alters the transcriptional activity of said promoter in at 
least one gram-positive organism." 

In addition to the foregoing, Applicants submit that claims 1-4, 7-11, 13-15, 18-22, 24- 
27, 29-33, 36-40 and 42-47 are not obvious because, based on the above-cited combinations of 
references, a skilled artisan would neither be motivated to construct the claimed 
promoter/operator fusions nor would they expect such fusions to regulate transcription in an 
organism. In particular, the reference Israelson et al. does not teach or suggest that a combination 
of a T5 promoter with a xyl operator would be a construct that would have a regulatable 
transcriptional activity in an organism. In fact, the general knowledge in the art, as exemplified 
by Jarmer et al. (Exhibit G), suggests that such constructs would not possess regulatable 
transcriptional activity in either gram positive or gram negative organisms. As such, the 
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combination of the above-cited references does not render the subject matter of claims 1-4, 7-1 1, 
13-15, 18-22, 24-27, 29-33, 36-40 and 42-47 obvious. 

In view of the foregoing remarks, Applicants respectfully request that the Examiner 
withdraw the rejection of claims 1-4, 7-11, 13-15, 18-22, 24-27, 29-33, 36-40 and 42-47 under 
35U.S.C. § 103(a). 



Applicants believe that all outstanding issues in this case have been resolved and that the 
present claims are in condition for allowance. Nevertheless, if any undeveloped issues remain or 
if any issues require clarification, the Examiner is invited to contact the undersigned at the 
telephone number provided below in order to expedite the resolution of such issues. 

Please charge any additional fees, including any fees for additional extension of time, or 
credit overpayment to Deposit Account No. 1 1-1410. 



CONCLUSION 



Respectfully submitted, 



KNOBBE, MARTENS, OLSON & BEAR, LLP 





Registration No. 53,009 
Attorney of Record 
Customer No. 20,995 
(619) 235-8550 
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pLysS & pLysE 



Exhibit B 




TBI 07 12/98 

pLysS (Cat. No. 69659-3) and pLysE (Cat. No. 69658-3) are 4886bp plasmids constructed by in- 
sertion of the 17 lysozyme gene into the BamH I site of pACYC184 (1,2). These plasmids are not 
cloning vectors; they are used in A,DE3 lysogenic hosts to suppress basal expression from the T7 
promoter by producing T7 lysozyme, a natural inhibitor of T7 RNA polymerase. The two plasmids 
differ only by the orientation of the T7 lysozyme gene. In pLysS the T7 lysozyme coding sequence 
is in the antisense orientation relative to the tet promoter, so only a small amount of T7 lysozyme 
is produced. In pLysE large amounts of T7 lysozyme are produced from the tet promoter. The con- 
struct also contains the weak T7 <J>3.8 promoter immediately following the lysozyme gene. The 
pl5A origin of replication is compatible with those found in pBR322- and pUC-derived plasmids. 
Unique sites are shown on the circle map. 



1. Studier , F.W. (1991) J. Mol Biol 219. 37-44. 

2. Chang, A.C.Y. and Cohen. S.N. (1978) / Bacteriol. 134, 1141. 



pLysS & pLysE sequence landmarks 



Cm gene coding seq. 4449-2 1 9 

pl5A origin 58M493 

T7 lysozyme coding seq. 2015-2467 
(pLysS) 

T7 lysozyme coding seq. 19 18-2370 
(pLysE) 
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A hidden Markov model of <r A RNA polymerase cof actor recognition sites in 
Bacillus subtilis, containing either the common or the extended -10 motifs, has 
been constructed based on experimentally verified er* recognition sites. This 
work suggests that more information exists at the initiation site of 
transcription in both types of promoters than previously thought. When tested 
on the entire B. subtilis genome, the model predicts that approximately half of 
the <7 A recognition sites are of the extended type. Some of the response- 
regulator aspartate phosphatases were among the predictions of promoters 
containing extended sites. The expression of rapA and rapB was confirmed by 
site-directed mutagenesis to depend on the extended -10 region. 



Keywords: sigma factor, HMM, response regulator aspartate phosphatase, extended 
— 10 region 



INTRODUCTION 

To initiate transcription, RNA polymerase (RNAP) has 
to recognize and bind to the promoter region. In 
prokaryotic cells this ability resides in the 'specificity' 
(in Greek s = a) factor of the RNAP complex. The 
genome of Bacillus subtilis encodes at least 17 different 
a factors (Huang & Helmann, 1998). Growing cells 
utilize at least six different a factors: the housekeeping 
<r A , and cr B , a° y a D , a H and <r L , and B. subtilis uses yet 
another four during endospore formation: <r E , a F , a° 
and <r K . The remaining seven a factors were identified 
after sequencing of the complete genome and are all of 
the extracytoplasmic function (ECF) subfamily (Huang 
et al y 1998). 

The o factor in the RNAP complex recognizes and binds 
a specific conserved DNA pattern upstream of the 
transcription start site, thereby allowing the RNAP to 
associate with the DNA strand, first loosely in a 'closed 
promoter-polymerase complex', and then tightly, 
melting a local region of the promoter to form an 'open 
promoter-polymerase complex', resulting in initiation 
of transcription. When the transcription is initiated, the 
a factor is released from the complex. 

Every a factor facilitates binding of the RNAP complex 



Abbreviations: FP f false positive; HMM, hidden Markov model; RNAP, 
RNA polymerase; TP, true positive. 



by recognition of a specific binding site, usually located 
10 and 35 bp upstream of the transcription start site. <j A 
in B. subtilis participates in the initiation of transcription 
of most of the housekeeping genes. The consensus 
sequence recognized by er A , 5'-TTGACA-17 nt-TA- 
TAAT-3', is identical to the consensus that a 70 of 
Escherichia coli recognizes. cr A -dependent promoters 
from B. subtilis are easily transcribed by the a 70 of E. 
coli, but poorly the other way around (Camacho St 
Salas, 1999), which suggests that the RNAP of B. subtilis 
has a stricter requirement for binding than the RNAP of 
E. coli (Voskuil & Chambliss, 1998; Camacho & Salas, 
1999). This corresponds with the fact that earlier studies 
have shown that many Gram-positives including B. 
subtilis utilize an extended — 10 region in a large number 
of their cr A -dependent promoters. This region is located 
1 bp upstream of the — 10 region and is hence referred to 
as the —16 region. The consensus of this region is 5'- 
TRTG-3', where R = G/A (Helmann, 1995; Voskuil & 
Chambliss, 1998; Camacho & Salas, 1999), and it is 
therefore larger than the corresponding 5 / -TG-3 / motif 
found in E. coli (Ponnambalam et, aL, 1986; Keilty & 
Rosenberg, 1987). This extension is estimated to exist in 
less than 10% of the promoters in E. coli (Chan et aL, 
1990) and approximately 45% in B. subtilis. Especially 
in promoters containing this extended signal a series of 
A- and T-rich regions upstream of the —35 region has 
been observed. And both o^-dependent promoter types 
have an overrepresentation of A residues downstream 
of the -10 region (Voskuil & Chambliss, 1998). By 
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extracting this information it is possible to create a 
model for prediction of new sites. 

The complete genome sequence (4-2 Mb) of B. subtilis 
was published in November 1997 (Kunst et ai y 1997), 
and at present 4228 genes are annotated (SubtiList, 
1999). Approximately one-third of these genes have 
experimentally identified functions. The function of the 
second third can be predicted by homology to other 
known gene products. Among the last third of the genes 
there is most likely an unknown number of misclassified 
ORFs, and therefore the exact number of genes in B. 
subtilis remains unknown. It is therefore also not 
possible to estimate how many promoters exist in the 
genome of B. subtilis. If the annotated genes are correct 
and if only regions upstream of a gene and downstream 
of a terminator, or regions between genes arranged 
head-to-head, are defined as promoter regions, a con- 
servative estimation will be that B. subtilis has 1800 
promoter regions. The fraction of these that is dependent 
on g a is not known. 

There are currently no publicly accessible tools for the 
prediction of o A -binding sites in B. subtilis. Nobody has 
attempted to estimate the number of such sites. We have 
used hidden Markov models (HMMs) and trained them 
to recognize <r A -binding sites in B. subtilis from existing 
experimentally generated data. The goal of this work is 
to create a tool to predict the number of true signals 
within the genome. This work has the further aim of 
recovering possible hidden information in the surround- 
ing sequence of the two types of cr A -binding sites as they 
are known today. This will clarify the differences 
between the sites, and make it easier to distinguish 
between them. 

METHODS 

Hidden Markov models. The central idea of an HMM is to 
embed the statistics of a motif in a set of states with transitions 
between them. Each HMM state has a specific probability 
distribution over the four nucleotides and hence one may say 
that it * emits' nucleotides according to specific emission 
probabilities. There is a state for each position in the motif 
and the emission probabilities essentially end up being equal 
to the nucleotide frequencies at these positions. Hence, an 
HMM may be viewed either as a generative model which 
* emits' nucleotides according to specific statistics or as a 
scoring model which may be used to answer questions such as : 
'To what extent is a given sequence compatible with/similar 
to the sequences used to train the HMM?'. These two HMM 
interpretations are equally valid and the choice between them 
depends on the application in question (for further intro- 
duction to HMMs we recommend Durbin et al. y 1998). 

HMMs are generally well suited for searching for motifs like 
^-binding sites since they facilitate an easy and intuitive 
incorporation of prior knowledge about signals associated 
with the motif in question. Another advantage, compared to 
techniques such as neural networks, is the ease of relating 
trained model parameters to sequence information; for 
instance, it is possible to use the trained emission probabilities 
to directly read off any consensus signals found and to get a 
good idea of the information present in these signals. 



Prior knowledge may be included in the HMM architecture by 
addition or deletion of states, by biasing their nucleotide 
emission probabilities and/or biasing the probabilities of 
transitions between them. 

When a model architecture has been set up, the optimal 
parameters are estimated by the Baum-Welch algorithm, 
which maximizes the likelihood of the training sequences 
given the model - i.e. it finds the HMM parameters which best 
capture the statistics of the training sequences. 

The trained model is then used to analyse sequences not 
included in the training set. To get an idea of the model's 
ability to generalize, one may split the initial training set into 
10 parts and then repeatedly train on nine parts and test on the 
remaining part, until all parts have been tested once. This is a 
common technique known as a 10-fold cross-validation. It 
provides a way of estimating the extent of expected false 
positives and false negatives for a given threshold, when using 
the model to decode new sequences. 

'Decoding' is the term applied to the process of evaluating 
how well a sequence or sub-sequence fits a given HMM model. 
There are several ways to perform decoding, and we have used 
posterior decoding, where one calculates, for the /th nucleotide 
x i in the query sequence x of length L, the total probability that 
the state n x emitting it is state k y P(n i = k\x). Note that in 
general there are many paths through the model that could 
have emitted nucleotide x i while in state k (i.e. for which = 
&), so one must add the probabilities of all these parses to get 
the total probability. Formally, we have : 

P(x,n t = k) 
« P(x) 

The numerator may be written 

P(x, 7l l = k) = P(x l ..,X i9 ' 7t t — k) P(* J+1 ...X L | X 1 ...Xj, 7l i = k) 

= P(x l ...x iy n i = k) P(x l+1 ...x L |» l = k) (2) 

since all observations after x, depend only on n r The first and 
second term in this product may be calculated recursively by 
the forward and backward algorithm respectively (Durbin et 
al.> 1998). The remaining unknown on the right-hand side of 
equation (1), P{x), may also be obtained from the forward/ 
backward algorithms. 

HMM prediction. For predicting sites in the genome we 
calculate, for each nucleotide, the posterior probability that it 
was emitted by the first state of the — 10 region in a <r A -binding 
motif. Once we have the posterior probabilities of the — 10 
start-states at all nucleotide positions, we simply regard all 
probabilities above a certain threshold (determined by the 
cross-validation procedure described above) as statistically 
significant. Hence, whenever the posterior probability of the 
desired motif exceeds the threshold, the model is said to have 
' found ' a motif at that nucleotide. The better the motif and its 
contextual signals fit the model, the higher the probability 
score, and the more confidence will be placed in the prediction. 

Fig. 1 is a schematic view of the HMM used to predict <r A 
promoters in B. subtilis. The model incorporates known 
information about conserved positions in a A binding 
(Helmann, 1995; Voskuil Sc Chambliss, 1998), and was 
trained to pick up additional unknown signals. 

'background' is a state whose emission probabilities are 
obtained from a first-order HMM trained on the entire B. 
subtilis genome (direct strand). This represents a null model. 



2418 



Sigma A recognition sites 




Extended model 



Normal model 



A-T background 



O 




B 


> 


BACKGROUND! 


> 


E 







Fig. 1. A schematic drawing of the HMM used to predict g*- 
binding sites in B. subtilis. Each box indicates a submodel (see 
text). The arrows show the possible transitions between sub- 
models. A circular arrow indicates that the model is allowed to 
loop (stay in the same state for more than one base in the 
sequence) at the given position (Durbin era/., 1998). 



The reason for using a low-order Markov chain for the 
background is to avoid modelling the genome too explicitly, 
since most of the genome is presumably coding whereas the 
promoters generally reside in non-coding regions. If the 
promoter finder were combined with a gene finder, one could 
train the background state on supposed non-coding regions 
and conceivably improve the signal-to-noise ratio. However, 
it is unclear how the promoter-finding performance would be 
affected in coding regions. This is certainly a possible path of 
further investigation. 

The model shown in Fig. 1 is used exclusively for decoding 
(testing). For training purposes a loop model should be 
avoided, since in the absence of fully labelled training 
sequences it may end up using several motifs in its maximum- 
likelihood estimation even though only one is actually 
present; this will then distort the statistics of the motif states 
and hence impair decoding performance. Thus, during train- 
ing a second background (identical to * background ') is 
included on the right-hand side of Fig. 1 in such a way that all 



three signal states must pass on to this second background and 
from here to the end state *E\ 

Promoter regions as well as other intergenic regions are 
known to be comparatively A and T rich. Hence, in order to 
prevent prediction of cr A signals merely on the basis of A and 
T richness, a state has been added to the model ('A-T 
background* in Fig. 1). The 'signal* state is included purely 
for technical reasons in order to switch from the trained 
left-right model to the looping prediction model shown in 
Fig. 1. 

Note the presence of two alternative ^-binding site models in 
Fig. 1. This is motivated by the finding of two different 
submodels of binding sites - one with an extended — 10 region 
(extended) and one without (normal) (Helmann, 1995; 
Voskuil & Chambliss, 1998). As dictated by the data in the 
training set each model allows a separation of the — 10 and 
— 35 region of 16-21 bp and 4-10 bp between the — 10 region 
and the start site of transcription (Helmann, 1995). 

Fig. 2 shows a more detailed view of the extended and normal 
submodels. The consensus sequences of the —10 and —35 
regions are clearly marked, as are the A-T-rich states. Dotted 
lines indicate the presence of more states than could be 
comfortably shown. The presence of the 9 and 5 extra 
explicitly modelled states in the extended model reflects the 
expectation that there is more information in the binding sites. 
The rationale behind the self-looped states is to model length 
distributions between e.g. the signal state (Fig. 1) and the start 
of the —35 region. There are two more looped states in the 
normal model in order to compensate for the 9 explicit states 
in the extended model. If the length modelling differed 
markedly between the extended and normal model, one would 
risk situations where a submodel was preferred merely on the 
basis of length. The emission probabilities of the looped states 
are identical to the background state. 

Datasets. Only sequences from experimentally verified <r A 
promoters were used for training and testing. We obtained 
these sequences from a list on John D. Helmann's worldwide 
web page (http://www.bio.cornell.edu/microbio/helmann/ 
helmann.html) (Helmann, 1995) containing 236 cr A -dependent 
promoters. Using a subset of these, which have supporting 
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Fig. 2. A schematic drawing of the two (^-binding site submodels, the normal (upper) and the extended (lower). This 
figure uses the same symbols as Fig. 1. Background states are white without letters, the — 10, —35 and + 1 regions are 
indicated, the TG motif is hatched and other explicitly modelled states are indicated by letters. A dotted line between 
two boxes indicates that the number of states in this region is greater than two. An arrow pointing at the dotted line 
between two boxes symbolizes that the model is allowing a bypass of states. This allows the number of states between 
the - 10 and the -35 regions to vary between 16 to 21 and likewise 4 to 10 between the - 10 region and the + 1 state. 
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Table 1. Bacterial strains and plasmids used in this study 



Strain/plasmid 


Genotype/description 


Ref./source 


B. sub ttlis 






HH263 


trpCl (168 wild-type) 


C. Anagnostopoulos* 


H0J1 


trpC2 amyE: :pBl 


HH263/pBl, Neo R 


H0J4 


trpC2 amy£: :pB4 


HH263/pB4, Neo R 


H0J5 


trpC2 amyE: :pAl 


HH263/pAl, Neo R 


H0J8 


trpC2 amyE: : pA4 


HH263/pA4, Neo R 


E. coli 






MC1061 


F" araD139 A{ara-leu)7696 galK16 A{lac)X74 rpsL (Str R ) hsdR2 (r m") mcrA 
mcrB 


Stratagene 


Plasmids 






pDG268neo 


Ap R (£. co//), Neo R (B. subtilis); pBR322 derivative; vector used for 
integration of transcriptional lacZ fusions into the amyE gene 


Saxild etal. (1996) 


pBl 


Ap R (£. coli), Neo R (B. subtilis) ; EcoRI-BtfmHI PCR fragment containing 
the wild-type promoter of rapB (260 bp, 23 bp upstream of the ORF) cloned 
into pDG268 


This work 


pB4 


As pBl with a base substitution G to C (in the extended —10 region 42 bp 
upstream of the ORF) 


This work 


pAl 


Ap R (£. coli)> Neo R (B. subtilis) ; EcoRl-BamHl PCR fragment containing 
the wild-type promoter of rap A (93 bp, 27 bp upstream of the ORF) cloned 
into pDG268 


This work 


pA4 


As pAl with a base substitution G to A (in the extended — 10 region, 43 bp 
upstream of the ORF) 


This work 



* C. Anagnostopoulos, INRA, Jouy en Josas, France. 



experimental data, and which are labelled at the transcription 
start sites (109), combined with some (11) determined in our 
laboratory (H. H. Saxild, unpublished results) and some (10) 
found in existing literature (Huang et al. y 1998; Huang &c 
Helmann, 1998; Lewis et ai, 1998; SubtiList, 1999; Zhang & 
Begley, 1991), a list of 130 cr A -dependent promoters was 
constructed. .The 130 sequences are 100 bp long and range 
from approximately —85 to + 15 relative to the transcription 
start site. 

Bacterial strains, plasmids and growth conditions. The 

bacterial strains and plasmids used in this study are listed in 
Table 1. Strains H0J1, H0J4, H0J5 and H0J8 have a single- 
copy rap-lacZ transcriptional fusion inserted by a double- 
crossover recombination event at the amyE locus, with and 
without a base substitution in the extended — 10 region. Cells 
were grown at 37 °C as described previously (Saxild et al. t 
1995). Spizizen salt-buffered minimal medium supplemented 
with 100 ug L-tryptophan ml" 1 was used in the enzyme assay 
and Luria-Bertani (LB) broth was used as rich medium. The 
relevant antibiotics were used at the following concentrations : 
neomycin, 5 ug ml" 1 ; ampicillin, 50 ug ml" 1 . 

DNA manipulations and genetic techniques. Chromosomal 
and plasmid DNA was isolated as described previously by 
Saxild et al. (1996). Treatment of DNA with restriction 
enzymes and T4 DNA ligase was performed as recommended 
by the supplier. Transformations of E. coli and B. subtilis were 
performed as described previously by Saxild et aL (1996). 
DNA sequencing was performed by the chain-termination 
reaction method using dideoxyribonucleotides as described 
by Sanger et al. (1977) using the Amersham Pharmacia Bio- 
tech Thermo Sequenase radio-labelled termination cycle 
sequencing kit. All sequencing was done with double-stranded 



plasmid, and was performed as described by the supplier. 
All PCRs were performed as described previously (Zeng 6c 
Saxild, 1999). PCR product DNAs were isolated by the use 
of GFX PCR DNA and gel band purification tubes from 
Amersham Pharmacia Biotech. 

Construction of clones. The promoter regions from rap A and 
rapB were obtained by a PCR on chromosomal DNA from the 
wild-type B. subtilis strain 168. In addition to the wild-type 
promoter region, site-directed mutations were incorporated in 
the extended —10 region by using PCR primers with 
mismatches. The amplified promoter fragments with a 5' 
EcoKl linker and a 3' BamHl linker were cloned in a 
transcriptional fusion with the reporter gene lacZ using the 
vector pDG268neo. The plasmid was amplified in E. coli 
MC1061, linearized with Kpnl and transformed into B. subtilis 
HH263. The transcriptional fusion was integrated into the 
amyE gene by a double-crossover event (Saxild et al. y 1996). 
All strains containing a transcriptional fusion were confirmed 
by colony PCR with relevant primers, sequencing of the 
cloned promoter region and verification of the AmyE" 
phenotype, by screening for inability to produce clearing 
zones on LB plates containing 1% starch. Primers comp- 
lementary to regions on each side of the cloned fusion were 
used in PCRs to confirm that no double insertion had occurred. 

Primer extension. RNA was isolated from H0J1 and H0J5 as 
described by Saxild et aL (1995). The single-stranded DNA 
primer (annealing just downstream of the BamHl cloning site 
in pDG268) was radiolabelled at the 5' terminus using T4 
polynucleotide kinase and [y- 33 P]ATP. The primer extension 
was performed by using the displayTHERMO-RT Reverse 
Transcriptase kit from Display Systems Biotech. The radio- 
labelled cDNA probes were separated on a 6% poly- 
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acrylamide sequencing gel next to a sequencing of pBl/pAl 
with the same primer, and visualized by autoradiography. 

^-Galactosidase activity assay. Growing cells were harvested 
by pouring 25-30 ml culture into a 50 ml centrifuge tube 1/3 
full of ice, centrifuging at 7000 g for 5 min, washing with 
10 ml of a 0-9% NaCl solution, centrifuging at 7000 ^ for 
5 min, washing with 2 ml of a 0-9% NaCl solution, centri- 
fuging at 15000 g for 2 min, discarding the liquid phase, gently 
adding 0 5 ml 30 mM phosphate buffer (pH 7-5), 1 mM EDTA 
and 1 mM DTT (sonication buffer) without resolving the 
pellet, and stored at —20 °C. The total amount of protein was 
determined by the Lowry method. The /?-galactosidase activity 
assay was performed using the method of Miller (1972). 

RESULTS 

Fig. 3 shows the average performance of the trained 
model on the test sets in the cross-validation experiment. 
The true positive (TP) rate is the fraction of test 
sequences which are predicted by the model to be o A 
sites. Ideally, TP should be 1, which would correspond 
to a sensitivity of 100 % , but this is hardly ever feasible 
without paying a price in terms of a high false positive 
(FP) rate. The FP rate is the fraction of non-cr A sites 
which are nevertheless identified by the model to be o x 
sites. Hence, ideally one wants FP equal to zero, in 
which case the specificity of the model is 100%. Note 
that in order to calculate the FP rate, one really needs a 
set of sequences to which one is sure that cr* does not 
bind. Such a set is currently difficult if not impossible to 
obtain, so in the absence of a better alternative we used, 
for each of the 10 trained models, 1000 randomly 
generated sequences of length 100 with a statistic equal 
to 'background' statistics. 

In most classification scenarios there is a trade-off 
between sensitivity and specificity, and one has to find a 
balance (by choosing a threshold) which is sensible for 
the application in question. From Fig. 3 it is clear that 
one can achieve a TP rate of about 07 with a very low 




True positive (TP) rate 



Fig. 3. The rate of false positives versus the rate of true 
positives. 
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FP rate. Thus the probability threshold corresponding 
to this sensitivity was our chosen threshold. Note that 
the false negative rate at this threshold is 1 — TP = 0*3, 
meaning that 30% of all true sites are not reported. 

The predicted signals always contain the whole er A - 
binding motif, and the expected transcription start site, 
but are reported based on the score from the — 10, rather 
than the —35, sequence. As observed in £. coli, the — 10 
region of B. subtilis tends to occur unaccompanied by a 
—35 signal (or accompanied by an extremely poor one) 
in <r A -binding promoters (Camacho & Salas, 1999), 
though usually dependent on an activator (Lewis et aL, 
2000). The converse is relatively rare, and presumably 
such single —35 sites are not sufficient to bind <r A and 
should therefore not be counted as hits. 

In order to estimate the number of FPs made on the 
entire genome we generated 100000 random sequences 
of length 100 with * background' statistics and counted 
the number of sequences scoring higher than the chosen 
cutoff. We did this three times and got 185, 199 and 225 
sequences respectively. We then simply assume that the 
mean of these numbers, 203, is the expected number of 
FPs made on 100000 candidates. In addition to the first- 
order Markov statistics used in 'background', we also 
tried generating the random sequences from &th-order 
Markov chains for k = 0, 2 and 3. The number of 
sequences found on average in these cases was 791, 265 
and 270 respectively - i.e. they all performed worse than 
the chosen k = 1. 

Note that it is conceivable that some of the high-scoring 
random sequences would in fact bind a A in an ex- 
perimental setup and hence are not really FPs. Never- 
theless, this is our best estimate. In a genome of length 
4*2 Mb the model is therefore expected to find roughly 
(4-2 x 10 6 /100)(203/100000) = 85 FPs on both the 
positive and the negative strand, making a total of 170. 

Using the HMM we predict that the entire genome of B. 
subtilis contains 2538 er A -binding sites. When examining 
the list containing the reported results (1927 high- 
confidence predictions) we were able to locate 1127 of 
these within the 400 bp upstream regions of the 4228 
predicted genes in B. subtilis (SubtiList, 1999). Both 
these lists are available from the authors upon request. 

The model further predicts that approximately 50% of 
the predicted sites are of the 'extended' type, which is a 
little more than previous findings on smaller samples 
(45%) (Helmann, 1995). 

The sequence logos in Fig. 4 show the profiles of the 
HMM predictions. From this it is clear that more 
information exists in both types of <r A -binding sites than 
previously thought. It is especially clear that our model 
has found that the transcription start site in B. subtilis 
promoters dependent on o A is highly conserved. The 
consensus sequence of this signal is 5'-YRTA-3' ( + 1 in 
bold) in the normal type, and 5'-YRNA-3', where Y = 
C/T, R = A/G and N = nt, in the extended type. The 
most frequent observed +1 signal in <r A -binding 
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F/g. 4. Logos of the predicted (Abinding sites in B. subtilis. The logos shown are merged from six individual logos (the 
merging positions are shown in the figure by horizontal bars), each containing either the —10 or the —35 signal from 
either the normal (top logo) and the extended (bottom logo) type of a* recognition sites. Each of the -10 and the -35 
signals represents approximately 500 signals predicted by the HMM. Each of the +1 logos is generated from 350 
predicted signals. The logos are constructed by aligning the six types of signals on the first base of the reported signal. 
For the +1 signal, this base is represented by the highest peak in that area, with an A on the top in both types of 
binding site. The Shannon information content is shown on the y axis; Shannon's unit of non-randomness is the bit (short 
for 'binary digit') (Shannon, 1948). 



promoters is, according to these results, 5'-TATA-3'. 
The signal in the 4- 1 position is strongest in the extended 
type, and here an A is much more frequent than a G. 
This may be to compensate for the fact that the + 2 
position in this type of binding site is less conserved. 

The model has, as expected, found that the extended 
type of o^-binding sites has an A- and T-rich area 
approximately 4 bp upstream of the —35 region, and 
that the 3' end of the consensus of the —35 region in this 
type of promoters is poorly conserved. The middle 
section of the — 10 region is likewise less conserved in 
the promoters of the extended type when compared to 
the normal type. The —16 motif is found to be 5'- 
TNTG-3', which almost corresponds with the findings 
in other Gram-positives and previous findings in B. 
subtilis (Helmann, 1995; Voskuil &C Chambliss, 1998; 
Camacho &c Salas, 1999). Both types of promoters seem 
to have a slightly conserved tail with a length of 2 nt of 
Ts and As following the —35 region, and likewise it is 
found that the level of As is above average downstream 
of the — 10 region. 

The model identified <r A -binding sites in the expected 
promoter regions of eight of the response-regulator 
aspartate phosphatase encoding genes (the rap genes). 
Our model predicts that rapB, -D, -E, -/ and -K are 



rapB 
rapA 
rapC 
rapF 
rapJ 
rapD 
rapH 
rap I 
rapK 
rapE 
rapG 



ATAC ATT ATGATAAAATATAACC&A 
TGTAAATATGATAAAATATGACATA 
TATAAACATGATAAAATATGACATA 
AATGTTGATGATAAAATATGACATA 
AACAGCTATGATAAAATATAACATA 
AAAAGTTATGATATGATAATTATAG 
TTTGGGATTGATAGAATATGACAT& 
GGTTATTCTGACATAATACAATTAA 

AATGACTATGTTATGATTGTTTTCG 
CGAAAACTTGTTAATATTTACAGTA 
GAAAGAGGTGTTACTATCAGAATAA 



Fig. 5. A multiple alignment of the expected — 10 region of the 
rap genes. Fully conserved residues are shown in bold. The TG 
motif is observed in all the rap genes. The positions of 
transcription start are shown by underlining the first base of 
the transcript. The transcription starts indicated for rapA and 
rapD are experimentally verified (Mueller eta/., 1992; Huang & 
Helmann, 1998), the rest are solely predicted by the HMM, 
having a score higher than the cutoff used. 



transcribed from a cr A -dependent promoter using the 
extended c7 A -binding site, and that rapF, ~G and -H have 
the normal <r A -binding site. 

In Fig. 5 the expected — 10 regions of the rap genes are 
aligned. There appears to be a highly conserved con- 
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Table 2. /?-Ga!actosidase activity 



Strain 


Relevant 


/7-Galactosidase 




genotype 


activity (±SD)* 


H0J1 


Wild-type 


45 (±18) 


H0J4 


G 42 -C 


4-1 (±2-3) 


H0J5 


Wild-type 


710 (±290) 


H0J8 


G 43 -A 


170 (±71) 



*The /?-galactosidase activities are reported as the mean (± stan- 
dard deviation) of eight independent measurements. 



sensus containing a TG motif immediately upstream of 
a — 10 motif in all the rap genes, which at least implies 
that this group of genes are being transcribed from a 
promoter containing an extended — 10 region. 

The aligned putative extended — 10 region for rapF in 
Fig. 5 is not the one predicted by the model. The model 
predicts a — 10 region 10 bp further downstream. The 
sequence shown in Fig. 5, however, aligns with the 
experimentally verified — 10 region in the promoter 
region of rapA and -D. We suggest that rapF might 
utilize both putative <7 A -binding sites. 

Experimental verification of predicted extended sites 

We tested these predictions by site-directed mutagenesis 
of the extended region within the predicted <r A -binding 
site. The site-directed mutagenesis (see Table 2) indeed 
showed a decrease in transcription for both rap A and 
rapB throughout a sporulation experiment {rapA, -B 
and -E are known to play a role in the phosphorelay 
signal-transduction system of sporulation: Mueller et 
aL, 1992; Jiang et aL, 2000), confirming that this region 
is necessary for transcription. When the G in the TG 
motif in the promoter region of rapB is substituted with 
a C, the amount of transcript drops on average 10-fold 
in the sporulation experiment. Likewise, when the 
corresponding G upstream of rap A is substituted with 
an A, the level of transcription drops approximately 
fourfold. 

Primer extension of rapA confirmed two previously 
mapped transcription start sites (1 bp apart) (see Fig. 5) 
(Mueller et aL, 1992). For rapB, we were unable to detect 
any clear signal in repeated experiments, presumably 
due to the lower expression level of this gene, to 
instability of the messenger, or to both. 

DISCUSSION 

By using the HMM-based prediction tool we have 
constructed, we are able to predict that the genome of B. 
subtilis contains roughly 2538 cr A -binding sites. We have 
generated a list containing 1127 binding sites, which are 
located within the 400 bp sequences upstream of pre- 
dicted genes. By examining Fig. 3 it is clear that the 
constructed model can predict almost 70% of the true 
sites, virtually without predicting sites that do not 



Sigma A recognition sites 



actually bind the factor. It is also clear that the model 
can be used to predict an even larger percentile of the 
true sites with a low level of false positive (FPs). The 
model would predict only 1 % FPs when predicting 83 % 
of all true binding sites, or 7 % when predicting 94 % . In 
cases where a rate of FPs of almost 22 % is acceptable, 
all true binding sites would theoretically be predicted. 

When using the chosen cutoff, we are unable to locate 
approximately 30% of the true o^-binding sites. These 
false negatives are binding sites that in a variety of ways 
differ from the average <r A -binding sites. One example of 
true a A -binding sites that this prediction tool has 
difficulties in finding are the SpoOA-activated promoters. 
These promoters are known to have one or several OA 
boxes at or near the —35 region, where Spo0A~P 
binds and activates transcription. This binding abolishes 
the negative effect of not only the poorly conserved — 35 
regions, but also the exceptionally large separation 
between the —10 and the —35 region (more than 
21 bp), which promoters of this type are known to have 
(Lewis et aL, 2000). These sites do not fit the model due 
to the fact that the model only allows a spacing of 
16-21 bp. Despite this drawback, we chose to accept 
this restriction because it gave rise to the model with the 
best overall performance. 

The large spacing and poorly conserved —35 regions, 
which are often observed in activator-dependent 
promoters, could explain why the model does not find 
any true tr A -binding sites in either rapA, rapC, rapE, or 
the putative second site in rapF, though there apparently 
exists a strong signal for an extended — 10 region in this 
group of genes, rap A, rapC and rapE are known to be 
activated by the binding of ComA~P to a ComA box 
upstream of the —35 region (Mueller et aL, 1992; 
Lazazzera et aL, 1999; Jiang et aL, 2000). When the 
expected promoter regions of the rap genes are aligned, 
it appears that rapF has a ComA box consensus site at 
the same position as rapA, rapC and rapE, which 
strongly suggests that expression of rapF is also de- 
pendent on ComA~P (alignment not shown). 

In Fig. 4 it is observed that the HMM has found a highly 
conserved signal at the 4- 1 position. It appears that the 
site of initiation of transcription in o^-dependent 
promoters is separated from the — 10 Pribnow box by 
on average 7 bp and has the consensus 5'-pyrimidine- 
purine-T-A/C-3' (most frequent: 5'-TATA-3') 5 and 
starting transcription at the purine. This corresponds 
with findings in E. coli, where the initiation site in a 70 - 
dependent promoters is -purine-pyrimidine- (Rosenberg 
& Court, 1979; Pedersen & Engelbrecht, 1995). 

From this work, it is suggested that the cr A -binding sites 
classified as extended are significantly different from 
normal cr A -binding sites in two areas of the promoter 
sequence. These differences are the — 16-motif and the 
four bases approximately 40 nt upstream of the in- 
itiation site, which seem to be rich in A and T. The 
extended type has likewise been found to be less 
conserved at position 5 of the —35 region (the C in 
TTGACA) and at the +2 position in the +1 motif. 
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In conclusion, we have constructed an HMM that has 
identified er A -binding sites in B. subtilis with known 
sensitivity and specificity. We have estimated the total 
number of <7 A -binding sites to be around 2538, and found 
the ratio between extended and normal — 10 regions of 
o- A -binding sites to be around 1:1. To support these 
findings we have experimentally verified that two of the 
predicted promoters indeed depend on an extended type 
of ^-binding site. 

The trained HMM is available from the authors upon 
request. The list of predictions from the trained HMM 
is available as supplementary data with the online 
version of this paper at http://mic.sgmjournals.org. 
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